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ABSTRACT 

A presentation of the 40-year history of 
psychometrics is given with comments about needed trends for the 
future. Computers have radically changed the time required for data 
processing. In testing, many promising developments, such as 
Kristof’s reliability for vector variables, latent class and latent 
struction models, one-factor ration scale in testing and Bayesian 
procedures, are still largely in the theoretical field. Interest in 
scaling did not become important until Messick applied methods 
previously developed to attitude scales in 1956. Multidimensional 
scaling techniques have recently been utilized in a number of 
research areas and applied fields. Factor analysis theories are 
reasonably well developed. Applications to aptitude tests have been 
made, but have been only sketchily used in other fields in which they 
would be extremely valuable, such as economics, sociology, and 
physiology. In the field of mathematical learning theory, work needs 
to be done for individual learning curves and in comparing various 
stochastic and continuous models. Quantative psychology has moved a 
long way in 40 years. (DJ) 



ED 069706 



rm-72-8 

RESEARCH 

MEMORANDUM 



LOOKING BACK AND LOOKING AHEAD IN PSYCHOMETRICS 
Harold Gulliksen 



QO 




U.S. DEPARTMENT OF HEALTH. 

EDUCATION & WELFARE 
OFFICE OF EOUCATION 
THIS OOCUMENT HAS BEEN REPRO- 
OUCEO EXACTLY AS RECEIVEO FROM 
THE PERSON OR ORGANIZATION ORIG- 
INATING IT POINTS OF VIEW OR OPIN- 
IONS STATEO 00 NOT NECESSARILY 
REPRESENT OFFICIAL OFFICE OF EOU- 
CATION POSITION OR POLICY 



04 

o 

o 



pr.’.r, 




This is a revised version of an after dinner talk 
given on March 30, 1972, at the spring meeting of 
the Psychometric Society, Princeton, New Jersey. 



Educational Testing Service 
Princeton, New Jersey 
July 1972 




1 



Looking Back and Looking Ahead in Psychometrics 
Harold Gulliksen 

In presenting the history, I shall do so in terms of the various 
investigators who worked in this area, even though I would, in general, agree 
with James Thurber in doubting the great man theory of the development of 
science* He points out that some people think it was a great day , and a 
critical event when Benjamin Franklin sent up his kite and brought down 
lightning, demonstrating its fundamental similarity to electricity (Thurber, 
1937). Others feel, however, that this event was not particularly critical, 
believing that if Franklin had not done this, somebody else would have made 
the same discovery; and we can see that this was exactly what happened with 
the harnessing of steam and the invention of the gas engine. Franklin didn't 
make these discoveries, and sure enough somebody else did. Q.E.D. 

i: Looking back in Psychometrics u for me goes to 3929 when I was a graduate 
student at Ohio State University and Thurstone gave a seminar on the theory 
and applications of the law of comparative judgment and the method of paired 
comparisons. Numerous topics, such as art, esthetics, ethics, subjective 
values, etc. , had previously been dismissed with "What can you do about field 
or topic X? It is all a matter of opinion, and opinions disagree. M I was 
tremendously impressed by the idea that now there was a clear-cut theory and 
experimental procedure for a rigorous treatment of those areas that are 
entirely a matter of opinion, and it was essential for the use of the method 
that the opinions should disagree (see Thurstone, 1959* for a collection of 
his articles on scaling). 

^This is a revised version of an after dinner tail; given on March 30, 1972 
at the spring meeting of the Psychometric Society, Princeton, N. J. 
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Follcwing the advice of Albert Paul Weiss, the senior professor in 
experimental psychology at Ohio State, I attended the University of Chicago 
summer school in 1929, and took a six-week course with Thurstone in which 
he covered test theory, scaling, factor analysis, and I believe, mathematical 
learning theory. A week or two was spent on each of the four topics, and 
that was that. 

A decade earlier E. L. Thorndike perceived the necessity and possibility 
for such developments in quantitative psychology — ''Whatever exists at all 
exists in some amount. To know it thoroughly involves knowing its quantity 
as well as its quality" (Thorndike, 1918, page 19*0 • 

Computers ' 

Indicating the developments of the last I 4 O years requires mention of 
the electronic computer. I was a research assistant for a year working on 
Thurstone's first study of primary mental abilities. The computational work 
in resolving a battery of about 50 tests into seven primary mental abilities 
meant that I was supervising a group of about 20 computer clerical workers 
for about a year. I recall Thurstone lamenting that his Ph.D. candidates 
would not be able to do factor analysis dissertations because it would not 
be practical to employ such a crew for each Ph.D. thesis. A few years ago, 
a research worker in the Civil Service in Washington, D. C. , wanted some help 
in analyzing a set of attitude scales which he had given to different types 
of persons working under Civil Service, in order to see how the jobs could be 
changed to make them more attractive. He came up one afternoon, with his 
data on punched cards, and we started about 14:00 in the afternoon to rim the 
preliminary error detecting program and we corrected the cards whenever errors 
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were found/ In all, including the scaling, correlations, factor analyses , • 
and rotations , although the job was somewhat larger than the primary mental 
abilities one, we were finished about 3:00 the next morning. 

Testing 

During the past I4O odd years we have come a long way, as you all know, 
since the publication of Thurstone’s ( 1931c) first test theory text. In 
reliability theory the widely used K-R 20 and K-R 21 were developed (Kuder & 
Richardson, 1937)> we have now progressed to Kristof’s (1972) reliability 
for vector variables. 

Latent class and latent structure models have been developed by 
Birnbaum, Lazarsfeld, and Bert Green among others. Rasch has presented a 
theory for a one- factor ratio scale in testing. Mel Novick, Charles Lewis, 
and others have worked on Bayesian procedures. These and other developments 
have been presented and summarized by Lord and Novick (1968) in Statistical 
Theories of Mental Test Scores . The theory and practical applications of 
tailored testing are being investigated by Lord (l97l)> Cronbach and Gleser 
(1965), and others. I 

The foregoing developments, however, are still largely in the theoretical 
field. I hope they will have greater impact on the standard aptitude and 
achievement tests. As far as I am aware, there has been little or no impact 
on teacher constructed tests used in grading classes. Instead, there is 
today what seems to be a serious movement away from any type of measurement 
in education, rather than an attempt to use better measurement methods. 

In aptitude test development, the rule is to take validity coefficients 
seriously without raising the question: "Should this validity be high or 
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low?" During World War II, while working on aptitude and achievement test 
development for the Navy, Norman Frederiksen and I obtained considerable 
experience in this area. At the Gunners' Mates school, we found that the 
validity of the reading test was high, and the mechanical knowledge and 
mechanical comprehension tests had low validity. We worked for about six 
months and developed identification and performance tests that measured the 
objectives given to us by the Gunners' Mates school. On the basis of grades 
on the new achievement testing program, the validity of the reading test 
took a nose dive, and the mechanical comprehension and mechanical knowledge 
went up. The same thing happened in basic engineering, where the arithmetic 
test showed highest validity initially. Nicholas Fattu worked for a year 
developing gauges to measure the products quickly and accurately, and, on the 
basis of the achievement measures, the validity of the arithmetic test dropped 
and that of the mechanical aptitude tests went up. Similar results were 
obtained in the Torpedoman's and other schools. Some of this work has been 
written up in Stuit (19^7), especially chapters XII, XIII, and XV. 

I think school and college grades are in need of similar scrutiny. For 
example, the spatial relations test of the College Board showed good validity 
for grades in some engineering drawing classes and poor validity for other 
engineering drawing grades. Such results would be expected if these 
courses , which were all given the same name — engineering drawing — were 
in fact quite different, and were graded on different bases. A general 
discussion of such problems under the title of "Intrinsic Validity" can 
be found in Gulliksen ( 19 50 ) . 
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A paper by Plotkin in the March 1972 issue of the A nte ri c an Psy ch ol ogi s t 
discusses problems in the area of the validity of tests brought to the fore 
by the Equal Employment Opportunities Act of 196U. 

I feel sure that Ben Shimberg and his associates, who are working at 
ETS on such things as tests for auto mechanics, will not accept the con- 
clusion that the major important quality for an automobile repairman is 
high verbal ability, on the basis of validity studies, but will see to it 
that the criteria are changed so that the important abilities are mechanical 
skill, trouble shooting ability, etc. 

Some years ago, in looking over validity coefficients for the Differential 
Aptitude Test, I noticed that for one school the best predictor of grades in 
Latin was the clerical test (.1*7). For the other tests of the Differential. 
Aptitude Test, the correlations with Latin grades ranged from a low of -.37 
for mechanical reasoning through -.02 for verbal reasoning, to a high of .19 
for sentences. It was pleasing to note that this was not generally true for 
all the schools studied. But it would be even more pleasing, if some steps 
had been taken to alter the teaching and grading procedures in that school. 
Other studies by the Psychological Corporation showed that higher educational 
level goes with higher clerical ability (see Bennett, Seashore, & Wesman, 

1959, pp. ^8, 79; 1966, pp. 5- 1 *2). 

In 1939 , Truman Kelley wrote on "Mental Factors of No Importance, n noting 
that only the verbal and quantitative abilities seem to be important as far as 
academic work in schools and universities is concerned. He spoke of the 
numerous abilities which even then were being isolated by factor analysis 
and indicated his fear that "many of the factors thus far * found* approach 



pretty close to the limit of no importance." This may well be true of many 
of the 150 or so factors in the French (1951) monograph, but before reaching 
any such conclusions, the school's teaching, testing, and grading procedures 
should be studied carefully and revised where necessary. own judgment 
would be that when this is done properly we will find that verbal and 
quantitative do not exhaust the list of useful abilities. 

In 1901, Clark Wissler, while getting his Ph.D. with James McKeen Cattell 
at Columbia, investigated the validity of a number of tests for predicting 
grades at Columbia. The validities ranged from -.02 for reaction time to 
.19 for logical memory. During the seven decades since then, aptitude and 
standardized achievement tests have advanced tremendously. However, I think 
the evidence is that college grades are now about the same as they were at 
the turn of the century. Wissler (1901 ) reported that the correlations of 
grades ranged from a low of .30 for Rhetoric and French to a high of .75 
for Latin and Greek, which I believe would be very similar to the correlations 
obtained today. 

During recent decades there have been a few attempts at improving the 
quality of college exams and grades. An example is the work of the examining 
office at the University of Chicago during the 1930 's and 19l0's under the 
direction of L. L. Thurstone and Ralph Tyler. The Chicago faculty later 
abandoned this program. 

The great need we have now is not for the improvement of aptitude tests , 
but for improvement in the criteria against which they are evaluated, including 
not only grades for four years in college, but activities and achievements 
during the Lo years after college. 
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Scaling 



In the early 1930's, Thurstone's interests changed from scaling to factor 
analysis. For the next decade development in the scaling area was very slow. 
Marion Richardson and I asked Gale Young and Alston Householder about the 
problem of determining dimensionality and a coordinate system for a set of 
points from the interpoint distances. They solved the problem and published 
the Young and Householder paper on multidimensional scaling, "A Discussion 
of a Set of Points in Terms of Their Mutual Distances," in 1938. The appli- 
cations of this method by Marion Richardson and Klingberg appeared at about 
the same time. Otherwise, not much appeared until Torgerson dev* - .oped the 
theory and verified a section of the Munsell color system in his thesis in 
1951. Messick (1956) applied the method to attitude scales a little later. 

Since then the development of scaling theory has been tremendous: the law 

of categorical judgment, the method of successive intervals, and statistical 
tests for fit of data to theory. These developments are presented in Torgerson 's 
(1958) Theory and Methods of Scaling . Since then there has been the extension 
to take care of individual differences, which makes the methods far more 
useful in attitude measurement and other applications in social psychology 
(see Carroll & Chang, 1970; Helm & Tucker, 1962; Tucker, 1972; Tucker & Messick, 
1963). Luce and Tukey (196L) have developed the theory of conjoint measurement 
and shown the independent foundation on which psychological measurement rests. 
Bock and Jones (1968) have given rigorous estimation procedures. The theory 
has been developed by Suppes and Zinnes, Tversky and others, so that measure- 
ments from psychological scaling are not dependent on other methods (see 
Luce, Bush, & Galanter, 1963, 1965). Indow and his group in Japan, Ekman 
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and his group in Sweden, and Stevens in the United States have been active 
in the development of theory and applications ‘of scaling. 

Applications of scaling techniques to linguistics and free recall are 
illustrated by the work of John B. Carroll (1971) and Friendly (1972). I^y own 
work (Gulliksen & Gulliksen, 1971) has also illustrated the application of scal- 
ing and factor techniques to attitudes toward work and leisure in cross-cultural 
comparisons. Coombs (196U) and his co-workers have developed a nonmetric multi- 
dimensional unfolding procedure, using this method to study confusions of Morse 
code signals, and showed two dimensions in the subset of 10 signals studied. 

While listening to the Psychometric Society papers here today, I was 
strongly reminded of Stephen Leacock's (l9ll) energetic young lord who flung 
himself on his horse and rode off in all directions at once. 

As indicated above there have been numerous applications of scaling tech- 
niques by research workers in various academic university settings. However, 
when we consider the various applied fields in which linear and multidimensional 
scaling could be used, the picture is different from the development of theory 
n nd applications in the academic setting. The various polling organizations 
report nothing but total percentages , sometimes broken down by various preformed 
categories, such as education, sex, rural-urban, etc. I have never seen a single 
instance where a factor analysis of a set of observations is given, so that 
various points of view, or clusters of opinions, can be found. Bob Tryon (1955) 
reported a factor analysis of voting areas around San Francisco, and found 
that the various indices available formed a three- factor system. He sug- 
gested that voting might well be associated with these factors, so that one 
would get better prediction by repeating such a study in connection with a 
new election poll and using these factors as independent variables in 
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adjusting the polling results. As far as I am aware, no polling group has 
paid any attention to such possibilities. 

Green and Carmone (1970) and Green and Rao (1972) have shown how multi- 
dimensional procedures could give valuable information in consumer surveys. 
Applications in behavioral sciences have been givon in Shepard, Romney and 
Herlove (1972). Applications in marketing research have been presented by 
Bass, King and Pessemier (1968). The use of scaling methods in studies ir 
perception by Carroll and others are reported in Carterette and Friedman 
(1973)-. Bell laboratories has compiled a bibliography of recent studies 
and applications of multidimensional scaling (Harris, 1972). It is pleasing 
to note that the scaling techniques have recently been utilized in a number 
of research areas and applied fields. 

Factor Analysis 

With respect to factor analysis, the initial papers presenting the prin- 
cipal components and other methods were published by Thurstone (l931e, b; 

1933) and Hotelling (1933) , following earlier work by Spearman and Holzinger. 
Thurstone presented his problem to Bliss in mathematics and Bartky in statistics 
one noon at the Chicago Quadrangle Club. He explained that he had a square 
symmetric array of numbers and wanted to express it in terms of summed products 
of a smaller array . Their reaction was , "Oh, you mean the square root of a 
symmetric matrix." In this way Thurstone learned that matrix theory existed 
and was relevant to the factor problem, so he embarked on a year or two of 
tutoring and published The Vectors of Mind (1935) , followed later by Multiple 
Factor Analysis (19U7), giving a concise summary of the crucial aspects of 
matrix theory and their use in factor analysis. Prior to the advent of 
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electron ic computer?, approximations such as the centroid method, with largest 
correlations used as initial estimates of communalities , were widely used 
because of their practicality* 

I remember yam Wilks remonstrating about this. He said, "We know a 
good method, the principal components, based on least squares, that gives a 
best fitting reduced rank matrix. Why can't you use that instead of these 
ad hoc approximations whose properties are unknown?" 

Since then Lawley, and JBreskog (1970) with Gruvaeus and van Thillo 
have presented the theory and associated practical computer methods. 

Kaiser's (1970) little jiffy is very widely used. Tucker has given us the 
procedures for double centered matrices (Tucker, 1956), for the isSer- 
battery matrix (Tucker, 1958) and for three and multi-mode analysis (Tucker, 
1966a). Harris (1962, 1963) has presented relations among factor theories 
and cautions that should be observed when attempting to measure change. 

Guttman (1971) has presented some extensions and applications of his facet 
theory. Horst (1961) has given possible applications of generalized 
canonical correlations. McDonald (1963, 1967) has given us his nonlinear 
factor analysis. Arbuckle (1970) has developed a procedure using the 
Toeplitz matrix as the error matrix instead of a diagonal matrix, so that 
the factor procedures may be applied to matrices where it is reasonable to 
assume "stationary error," as in analyzing nerve potentials. 

Factor analysis theory , and associated electronic computer programs, 
are in a reasonably well developed state. As to applications of the methods, 
it has been mentioned previously that numerous aptitude test batteries have 
been analyzed, so that psychologists have some reasonable notions regarding 
the basic abilities represented in aptitude tests. 
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The factor methods have, however, been only sketchily used in other 
fields where they would be extremely valuable . Harman ( 196 ?) devotes 
about a page of his text to indicating applications in economics, soci- 
ology, physiology, etc., but the impact of these factor studies on the 
fields indicated has been minimal. 

Schiffman and Falkenberg (1968) have presented an interesting study of 
matrices with stimuli designating rows, by retinal cells designating columns; 
or stimuli by taste neurones, that give interesting pictures of the structure 
of these sensory systems. In the study of retinal cells, the stimuli spread 
out in a curve — violet, blue, green, yellow, orange, and red--while in the 
same space the retinal cells clustered three in the blue area, four in the 
green, and four in the red. For taste a definite three-dimensional structure 
was obtained, but the details are not so clearly interpretable. 

Memory is another field in which factor analysis would be extremely 
valuable. Paul Kelley's (196^) study, for example, demonstrated that memory 
span is a factor that includes visual and auditory material, as well as 
nonsense and meaningful material. However, when one deals with longer lists, 
so that it takes a number of repetitions to memorize them, then rote memory 
differentiates from memory for meaningful material. That is to say, when 
Ebbinghaus introduced the nonsense syllable, as he thought to simply control 
for the irrelevant factor of possible differences in previous associations, 
he was unwittingly shifting to measurement of a different ability. Recently 
there has been great emphasis on what is termed " free recall," which means 
that the material, though meaningfully organized, is presented randomly, to 
see the extent to which the subjects will make use of the organization in 
recall. Again, as with Ebbinghaus, it is assumed that this is simply another 
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interesting procedure for tapping the memory function. However, as far as 
I am aware, no factor studies have been made including both the free recall, 
the rote, and the meaningful memory where the order of presentation must 
be the order of recall. We do not know whether the free recall ability is 
the same as the previously established rote or meaningful memory, or whether 
a new ability has been introduced with this new procedure. Stake (1961) 
has evidence indicating the possibility that the change from free study 
such as Ebbinghaus (1885) used, to the memory drum, introduced by MUller and 
Schumann (189^) merely as an added experimental control, may have altered 
the ability being measured. 

Eight indices of "excitatory potential" (a useful hypothetical construct) 
were used by Lloyd Humphreys (19^3) in a conditioned eyelid experiment. He 
found two factors. Acquisition amplitude and extinction amplitude loaded 
on one factor, while acquisition and extinction latencies loaded on a 
different factor, along with extinction frequency. Acquisition frequency 
loaded equally on both factors. That is to say, the experimenter's selection 
of one or another from a set of possible indices may really be changing the 
hypothetical construct being measured. 

For decades it has been asserted that "intelligence is the ability to 
learn" (e.g., Binet, 1909, especially p. 11*6; Buckingham, 1921, especially 
p . 273; Dearborn, 1921; Peterson, 1926, especially pp. 268 & 276; Pyle, 

1921). This view that intelligence is the ability to learn has been critically 
examined (see Woodrow, 191*6; Peterson, 1926; Simrall, 191*7 , for example). 
Psychologists have devised numerous clever tasks in rote learning, meaningful 
learning, concept learning, motor learning, etc., such as mirror drawing, 
pursuit rotor, reversal learning, etc., which require an ability to learn. 
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that can be measured by time, errors or trials taken to reach some criterion, 
or by parameters of some learning curve fitted to the data. In only a feu 
cases have such studies also included a few of the standard test scores that 
may be related to intelligence. 

Studies in this area by Duncanson, Stake, Allison, Manley, Games, and 
Bunder son, reviewed in Bob Gagne’s ( 1967 ) Learning and Individual Differences , 
have indicated that there are a number of different abilities represented by 
the different learning tasks, as well as a number of different abilities 
represented by the intelligence or ability tests. So far some of these 
abilities seem to be unique to measures of learning, or to test scores; but 
there are some factors that have loadings on both the learning scores and 
the test scores. A clear answer as to the relation between aptitudes as 
measured by tests, and learning abilities as measured by various learning 
tasks devised by psychologists, is not available at present. The topic is 
in need of much further research. 

For the last bO years there has been a profusion of factor analyses of 
batteries of aptitude tests, but there are numerous other areas in learning, 
memory, physiology, nerve potentials, economics, political science, sociology, 
etc. , where factor analysis would be extremely valuable, and where only a few 
studies have been made. 

Learning 

With respect to mathematical learning theory, Thurstone (1930a) presented 
a mathematical derivation of an equation of the learning curve based on an 
urn analogy, which turned out to also be derivable from Thorndike’s Law of 
Effect (Gulliksen, 193*0. Thurstone (1930b) also showed how this theory could 
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be applied to determining a functional relationship between learning time 
and length of task, and to separating learning ability of the individual 
from the difficulty of the task, using factor methods. I presented an 
analytical procedure that separated learning ability from initial performance 
(Gulliksen, 19U2). 

During the 1950* s a variety of learning models were presented based on 
stimulus sampling ideas (Estes), on stochastic processes (Bush & Mosteller, 
1955)5 on stepwise increases or decreases in strength of correct and incorrect 
responses (Audley & Jonckheere, 1956) and various models suggested by Bower, 
Trabasso, Atkinson, Suppes, and others. 

In general these more recent models tended to have t*./o characteristics. 

(1) Response strengths, or response probabilities, changed by finite 
amounts with each trial. The substitution of differentials for deltas was 
believed to be an extremely inappropriate step that must be avoided. 

(2) In order to obtain good parameter estimates, it was usually assumed 
that all subjects in a group could be regarded as giving estimates of the 
name parameter values so that the record of the group of learners was 
analyzed to determine one set of parameters . 

There are several questions introduced here that it seems to me should 
be subjected zo careful experimental investigation, rather than being settled 
by assumption. 

(1) We now have a variety of stochastic, or finite step models, and also 
older continuous models. Both of these types of models should be tried out 
on various types of learning data. It is perfectly possible that different 
theories will be best for different types of tasks . The same type of theory 
may well not fit conditioned escape response, maze learning, visual shape 
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discrimination — with attention to transposition, paw retraction to avoid a 
shock, conditioned emotional reactions such as rapid breathing, etc. For 
example, there is evidence that conditioned paw retraction does not transfer 
from the right to the left brain in split brain animals, while increased 
breathing rate transfers very rapidly. We need now a large number of studies 
in which various stochastic and continuous models are tried out on various 
types of learning data. 

(2) I feel that the primary stress should be on using learning parameters 
that are psychologically meaningful. By this I mean parameters such as diffi- 
culty of task, learning ability, initial preference, and final performance 
on the task. These parameters seem to me to be meaningful in understanding 
differences between learning tasks, and differences between individuals in 
learning these tasks. Parameters such as number and length of runs of errors, 
average and variance of learning parameters, number of alternations, that have 
been frequently or usually used with stochastic models, seem to me to be param- 
eters selected because they fit with the stochastic models, rather than because 
they have any interesting psychological significance in understanding the learn- 
ing process or the differences between learning tasks and between learners. 

Bush and Hosteller (Chapter 15 in Bush & Estes, 1959) have given a 
comparison of eight different learning models , with respect to their agree- 
ment with Solomon and Wynne's data on shock avoidance by 30 dogs given 25 
trials each. The comparisons are entirely in terms of means and standard 
deviations of distributions of a number of variables, such as number of 
trials before first and second avoidance, total number of shocks, number of 
alternations, number of trials before the first run of four avoidances, etc. 

It is assumed either that the basic parameters are the same for all animals. 
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or else that the parameter varies according to some specified distribution. 

This seems to me to be an approach dictated basically by the characteristics 
of the stochastic approach, rather than by the psychologically interesting 
properties of learners and learning tasks. Determination of parameters for 
each learner offers a much better way of understanding the learning process, 
in terms of parameters of individuals, and parameters of the tasks, such as 
initial ability and learning ability, and difficulty of the task. 

( 3 ) Merrell ( 1931 ), Sidraan ( 1952 ), and Estes ( 1956 ) pointed out the 
difficulties involved in using group or average learning curves, yet 
obtaining a single set of parameters for the average learning curve is 
still a very usual procedure. Is it legitimate to regard learning parameters 
as the same for all subjects in a group, or do some subjects have definitely 
better learning ability, or initial performance than others do? Again the 
answer may be different with different types of learning problems and with 
variations in difficulty of problem. There are at least two possible 
approaches that should be tried on this problem of individual differences in 
learning parameters. One approach proposed, and tried out on some sets of 
data by Tucker (1966b) and by Weitzman (1963) , is a principal components 
analysis of a matrix of learning curves. The method gives a set of k 
learning parameters for each individual, and 1c generalized, or master 
learning curves. In the special case where k is equal to one, then it is 
legitimate to use the group or average learning curve. 

(M Parameter estimation for individual learning curves for the finite 
step models is a difficult problem. Procedures have been devised by Ramsay 
( 1970 ), Vainer (1968), and Best (1966} , for parameter estimation for individual 
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learning curves. Using Monte Carlo data with known parameters, Ramsay (.1970) 
found that the input parameters were not recovered except for the limited 
case of only two parameters, initial probability of a correct response, and 
the effect of reward of a correct reponse on the strength of the correct 
response- Ramsay also felt that negative parameters, which allowed a 
decrease in response strength, should not be permitted because this might 
lead to negative response probabilities. Best's ( 1966 ) procedure allowed 
for the possibility that the strength of the incorrect response would be 
decreased by punishment for an error. If the fitting problem is satis- 
factorily solved, then various stochastic and continuous models could be 
compared with respect to parameter determination for individual rather than 
group learning curves. 

( 5 ) One of the great handicaps in the study of learning has been the 
impossibility of obtaining evidence on reliability by replication. When a 
learning curve has been obtained, a second one on the same problem for the 
same individual is impossible, because he already knows the solution, and 
cannot learn it again. If one tries a different problem, there is the question 
of how similar the two problems are, and also the question of positive or 
negative transfer. If one tries the same problem with another individual, 
then there are the possibilities of different learning abilities for the 
different individuals. Sperry (1961, esp. p. 1753; 196k, esp. p. k 8 ) felt 
that his work with split brain preparations offered opportunity for replication 
from left to right brain with what was essentially a duplicate subject. So 
far the ev ce here is conflicting. Meikle, Sechzer and Stellar (1962), 
working with cats suspended in a harness and learning to lift the front paw 
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to avoid a shock signalled by stroking the shoulder, found (for three animals) 
a very good linear relation between number of trials to criterion for right 
and left brain learning. Phil Best ( 1966 ) analyzed visual discrimination data 
from an experiment by Meikle and Sechzer (i960) and found a strong linear 
relation between first and second side learning in split brain cats. By 
contrast Ian Steele Russell (Russell & Kleininan, 1970), working with 
functionally split brain rats on a conditioned escape response, found that 
for a given difficulty of problem there was a zero correlation between 
trials to criterion for left vs. right brain. He points out that this is 
consistent with the view of learning as a finite step process . Recent work 
by me and Voneida also found marked dissimilarity between right and left 
brain learning in split brain cats. 

Another problem is raised by the probability learning situation. When, 
for example, one stimulus is rewarded 10% of the time and the other rewarded 
30% of the time, some investigators report that the subjects choose the 
stimuli about 10% and 30% respectively. This behavior is known as "matching” 
and would result in (.7 x .7) plus (.3 x .3) equals .58 success. Choosing 
the "1Q%" stimulus all of the time would result in 10% success. This is 
known as maximizing behavior. Stimulus sampling theory predicts matching. 
However, maximizing is frequently found. Vainer (1968) found that maximizing 
behavior was the rule, and succeeded in modifying the stimulus sampling theory 
so that with different parameter values it would predict either maximizing 
or matching. His data gave good agreement with the generalized theory and 
showed maximizing rather than matching. He devised methods of fitting 
parameters to individual curves. 
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Richard Rose (Rose, Beach, & Peterson, 1971) at the University of 
Washington has recently reviewed the major studies in this field and concludes 
that "probability matching though widely accepted by psychologists is not 
found when individual records are examined, instead of group averages. The 
individual response probabilities are much further away from matching or 
other theoretical values than would be permitted by the most generous inter- 
pretation of extant theories." In my view this points to the desirability 
of estimating parameters for individual rather than group curves. 

In the field of mathematical learning theory, it seems to me that a great 
deal of work still needs to be done on parameter estimation for individual 
learning curves and in comparing various stochastic and continuous models. 

By contrast, in the fields of test theory, scaling, and factor theory, the 
theory including parameter estimation, significance testing, and variance 
components analysis procedures are reasonably well developed. Test theory, 
though adequately utilized in standardized testing programs, has not yet had 
much impact on the teacher constructed tests and on grading procedures. Re- 
cently scaling, especially multidimensional scaling, has received considerable 
attention from workers in certain applied fields and in some research areas, 
but its use could be more widely extended, as for example in election polls. 
Numerous batteries of aptitude tests have been factor analyzed — but application 
of factor analysis to economics, sociology, physiology, etc. is just beginning 
to get under way. 

Quantitative psychology, which could be reasonably adequately covered 
by a six weeks* course in 1929, has moved a long way in the directions 
indicated by Thurstone (1937) in his Dartmouth address as retiring first 
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•president of the Psychometric Society, "Psychology as a Quantitative 
Rational Science." 

Presenting the topic, "Psychometrics — whence and whither," to this 
group is carrying coals to Newcastle, or maybe it is even gilding the lily, 
as you prefer. Much has been of necessity omitted in this brief presenta- 
tion of a ItO-year history. I have presented a few of the highlights, as I 
see them, and would be especially interested in your reactions. 
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